Theoretical Background

Theoretical Background

Self-organising Data Mining

From the Preface of the book "Self-Organising Data Mining" by Johann-Adolf Mueller and Frank Lemke

"Today, there is an increased need for information - contextual data - non obvious and valuable for decision making from a large collection of data. Commonly, a large data set is one that has many cases or records. With this book, however, 'large' rather refers to the number of variables describing each record. When there are more variables than cases, the most known algorithms are running into some problems (in mathematical statistics, for instance, covariance matrix becomes singular so that inversion is impossible; Neural Networks fail to learn). Even if the data are well-behaved, a large number of variables means that the data are distributed in a high dimensional hypercube, causing the known dimensionality problem. Therefore, decision making based on analysing data is an interactive and iterative process of various subtasks and decisions and is called Knowledge Discovery from Data. The engine of Knowledge Discovery - where data is transformed into knowledge - is Data Mining.

There are very different data mining tools available and many papers are published describing data mining techniques. We think that it is most important for a more sophisticated data mining technique to limit the user involvement in the entire data mining process to the inclusion of well-known a priori knowledge. This makes the process more automated and more objective. Most users' primary interest is in generating useful and valid model results without having to have extensive knowledge of mathematical, cybernetic and statistical techniques or sufficient time for complex dialog driven modelling tools. Soft computing, i.e., Fuzzy Modelling, Neural Networks, Genetic Algorithms and other methods of automatic model generation, is a way to mine data by generating mathematical models from empirical data more or less automatically.

In the past years there has been much publicity about the ability of Artificial Neural Networks to learn and to generalize despite important problems with design, development and application of Neural Networks:

Neural Networks have no explanatory power by default to describe why results are as they are. This means that the knowledge (models) extracted by Neural Networks is still hidden and distributed over the network.

There is no systematical approach for designing and developing Neural Networks. It is a trial-and-error process.

Training of Neural Networks is a kind of statistical estimation often using algorithms that are slower and less effective than algorithms used in statistical software.

If noise is considerable in a data sample, the generated models systematically tend to being overfitted.

In contrast to Neural Networks that use

Genetic Algorithms as an external procedure to optimize the network architecture and

several pruning techniques to counteract overtraining,

this book introduces principles of evolution - inheritance, mutation and selection - for generating a network structure systematically enabling automatic model structure synthesis and model validation. Models are generated from the data in the form of networks of active neurons in an evolutionary fashion of repetitive generation of populations of competing models of growing complexity and their validation and selection until an optimal complex model - not too simple and not too complex - has been created. That is, growing a tree-like network out of seed information (input and output variables' data) in an evolutionary fashion of pairwise combination and survival-of-the-fittest selection from a simple single individual (neuron) to a desired final, not overspecialized behavior (model). Neither, the number of neurons and the number of layers in the network, nor the actual behavior of each created neuron is predefined. All this is adjusted during the process of self-organisation, and therefore, is called self-organising data mining.

A self-organising data mining creates optimal complex models systematically and autonomously by employing both parameter and structure identification. An optimal complex model is a model that optimally balances model quality on a given learning data set ("closeness of fit") and its generalisation power on new, not previously seen data with respect to the data's noise level and the task of modelling (prediction, classification, modelling, etc.). It thus solves the basic problem of experimental systems analysis of systematically avoiding "overfitted" models based on the data's information only. This makes self-organising data mining a most automated, fast and very efficient supplement and alternative to other data mining methods.

The differences between Neural Networks and this new approach focus on Statistical Learning Networks and induction. The first Statistical Learning Network algorithm of this new type, the Group Method of Data Handling (GMDH), was developed by A.G. Ivakhnenko in 1967. Considerable improvements were introduced in the 1970s and 1980s by versions of the Polynomial Network Training algorithm (PNETTR) by Barron and the Algorithm for Synthesis of Polynomial Networks (ASPN) by Elder when Adaptive Learning Networks and GMDH were flowing together. Further enhancements of the GMDH algorithm have been realized in the "KnowledgeMiner" software described and enclosed in this book.

...

This book provides a thorough introduction to self-organising data mining technologies for business executives, decision makers and specialists involved in developing Executive Information Systems (EIS) or in modelling, data mining or knowledge discovery projects. It is a book for working professionals in many fields of decision making: Economics (banking, financing, marketing), business oriented computer science, ecology, medicine and biology, sociology, engineering sciences and all other fields of modelling of ill-defined systems.

Each chapter includes some practical examples and a reference list for further reading. The accompanying diskette/internet download contains the KnowledgeMiner Demo version and several executable examples. This book offers a comprehensive view to all major issues related to self-organising data mining and its practical application for solving real-world problems. It gives not only an introduction to self-organising data mining, but provides answers to questions like:

what is self-organising data mining compared with other known data mining techniques,

what are the pros, cons and difficulties of the main data mining approaches,

what problems can be solved by self-organising data mining, specifically by using the KnowledgeMiner modelling and prediction tool,

what is the basic methodology for self-organising data mining and application development using a set of real-world business problems exemplarily,

how to use KnowledgeMiner and how to prepare a problem for solution. ..."

Why Data Mining is needed

Decision making in every field of human activity needs problem detection in addition to a decision makers feeling that a problem exists or that something is wrong. The basis for every decision is models. It is worth building models to aid decision making for the following reasons:

models make it possible:

to recognize the structure and function of complicated objects (subject of identification) which leads to deeper understanding of the problem. Models can usually be analysed more readily than the original problem;

to find appropriate means which can be used for exercising an active influence on the objects (subject of control);

to predict what the respective objects have to expect in the future (subject of prediction) but also to experiment with models, and thus to answer "what-if" type questions.

Therefore mathematical modeling formed the core of almost all decision support systems.

Models can be derived from existing theory (theory driven approach or theoretical systems analysis) and/or from data (data driven approach or experimental systems analysis).

a. Theory-driven approach

For complex ill-defined systems, such as economic, ecological, social, biological a.o. systems, we have insufficient a priori knowledge about the relevant theory of the system under research. Modeling based on a theory driven approach is considerably affected by the fact that the modeler often has to know things about the system that are generally impossible to find. This concerns uncertain a priori information with regard to the selection of the model structure (factors of influence and functional relations) as well as insufficient knowledge about interference factors (actual interference factors and factors of influence which can not be measured). According to this, insufficient a priori information concerns the required a priori knowledge on the object under research be connected to:

the main factors of influence (endogenous variables or input variables) and also the classification of variables as endogenous and exogenous;

the functional form of the relation between the variables including the dynamic specification of the model;

the description of errors such as their correlation structure.

In order to overcome these problems and to deal with ill-defined systems and, in particular, with insufficient a priori knowledge, there is a need to find ways on how it is possible, with the help of emergent information engineering, to reduce the time and resource intensive model formation process required before one can start initial task solving. Computer-aided design of mathematical models may soon prove as highly valuable in bridging the gap.

b. Data-driven approach

Modern information technologies delivers a flood of data and there is a question how to leverage them. Commonly, statistically based principles are used for model formation. But with them there is always the need to have a priori knowledge about the structure of the mathematical model.

In addition to the epistemological problems of commonly used statistical principles of model formation, there are several methodological problems which may arise in conjunction with the insufficience of a priori information. This indeterminacy of the starting position marked by the subjectivity and incompletedness of the theoretical knowledge and an insufficient data basis leads to several methodological problems.

Knowledge discovery from data and specifically data mining techniques and tools can assist humans in analyzing the mountains of data and to turn information located in the data into successful decision making.

Data mining includes not just a single analytical technique but many methods and techniques depending on the nature of the enquiry. These methods contain data visualization, tree-based methods and methods of mathematical statistics as well as those for knowledge extraction from data using self-organizing modeling to turn information located in the data into successful decision making.

Data mining is an interactive and iterative process of numerous subtasks and decisions such as data selection and pre-processing, choice and application of data mining algorithms and analysis of the extracted knowledge. Most important for a more sophisticated data mining application is to try to limit the involvement of users in the overall data mining process to the inclusion of existing a priori knowledge while making this process more automated and more objective.

Automatic model generation like GMDH, Analog Complexing, and Fuzzy Rule Induction is based on these demands and provides sometimes the only way to generate models of ill-defined problems.

Contact:
Research

Date Last Modified: 06/18/00